Introduction

Submitted by: Susan Bataju

"Chest X-Ray Images (Pneumonia)“ from Kaggle was chosen for Lab 2. The dataset contains 5863 OCT (Optical Coherence Tomography) and Chest X-ray validated images split into two categories (Pneumonia and Normal) selected from pediatric patients ranging from one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou [2].The dataset was first collected, organized, and analyzed in “Labeled Optical Coherence Topography (OCT) and Chest X-Ray Images for Classification” [1] by Kermany, D; Zhang, K et al. Images are labeled as (disease)-(randomized patient ID)-(image number by this patient). There are bacterial and viral pneumonia infections mixed in Pneumonia dataset.

The purpose of collecting this dataset is to detect pneumonia in patients using their chest X-Rays. A classifier with high accuracy which can detect pneumonia using X-Ray images will be revolutionary for doctors around the world. The uses of such a classifier can range from verifying doctors’ assignments to the reduction of their workload or use in remote places with fewer qualified human resources.

Medical Industry will have major business interest in such a classifier. Someone’s life and well being might depend upon the accuracy of model as such it should have accuracy of greater than 98% but that is depended on use case. Even a few percentage of inaccuracy with largescale used can have adverisal effect on users.

The images are of various sizes and aspect ratio. To make uniform sized images and avoid distortion like streching when changing the aspect ratio all the images are cropped. If the height of the image is greater than its width the top and the bottom portions of image are cropped by equal amount until the height and width are equal and when width is greater than height then the right and left portion are cropped. After the images are cropped they will have aspect ratio of 1 so that the images can be scaled without any distortions. Generally, the cropped image will contain most portion of lungs where as the empty part in left and right and head and lower portion of the images are cropped, retaining the most useful information.

While all the images are gray scaled images, there are about 280 images in pneumonia images where the images are RBG, those images are transformed to grayscale images before use.

PCA, Daisy and Gabor filter are test as methods of feature extraction. The performation of these methods are tested with a KNeighborsClassifier.

Following are the first few images in normal and pneumonia dataset. We can see that the images are not same size and aspect ratio.

Below is a large ~600 images grid in Normal dataset. They are croped as described in introduction.

Data Preparation and Resizing

Following code will crop the images and product a square images then are scaled as 120x120 images. RBG images are converted to grayscale images.

Below is a large ~600 images grid in pneumonia dataset. They are croped as described in introduction and gray scaled as well.

Following are shows the final result of resizing and cropping on normal dataset.

Following code show how the RGB images are converted to gray scale images.

Before resizing and cropping the images, the following histogram show the distributing of original images. We should look are images which have high aspect ratio and make sure they are cropped properly.

As a basic image qualtity filter, looking at images with aspect ratio greater than 1.5 and height and width less than 600 pixels are all very zoomed and not sutiable for analysis. Those images are dropped.

Dropping the images with aspect ratio greater than 1.5 and height and width less than 600 pixels. I tried to remove tje images which looks to zoomed in.

These box plot show how much data are outliers, we can see pneumonia has the largest outlier. We can test removing these outlier images bacause after cropping there are most likely too much zoomed in.

Here, looking at some images with aspect ratio great than 1.5, we see than, most of the image show significant portion of lungs.

After resizing the images, we can see below that all the images are of same size and shape. Also, from the above figures we can see the cropped images after outliers are removed, looks much better. But there are still images which have lungs partially cropped like in the fifth image above.

Reshaping the image as (1,14400) shape vector i.e. rows in the images are joined together end to end into a siganl vector. Then the image vector,class and names are save in an array. Another test which can be done is increasing the size of the images.

Following shows all the images saved as a row vector each pixel is shown in x-axis and y-axis are different image samples

We can get back the image if we reshape the row vector into (120,120) sized vector which is shown below.

Now, make 2D arrays .

Let's test if standardizing the 2D array will have effect in later stages.

Data Reduction

Now, let's do PCA on the dataset. The code below is taken from Machien Learning Notebooks[3].

The explained varience from one principal component is greater than 90% and captures most of the varience.

The eigenvector encode important features, on 1st to 11th "eigenxray" show location of lungs and heart and some ribs features, on 12th to 17th we can see the ribs as well. Below we can see the last and first principal component with class label, 0 is normal and 1 is pneumonia. On the first principal component all the normal data point are less than 0.

Now, let's show the low dimensional representation of the image give by the PCA. Note that, the location, intensity of color in lungs are seen but most of the detail we can see by eye is not present.

Testing the stantarized 2D array.

It looks like standarizing the values changed the light color pixel more, like in 3rd eigenxray the light and dark are swapped and the final reconstructed image is darker as well and the PCA300 variable is also not seperating the classes as well as PCA.

The eigenxrays are very simial to the full PCA and tells a similar story. Ribs are seen after 9th xray and lungs and heart is seen before them.

Same conclusion as full PCA, most of the variance is defined by the first PCA.

Plot all the above reconstuction side by side.

The randomized PCA and full PCA are both performing comparably, later we can see a small improvement in classifier accuracy if using full PCA.

Feature Extraction

Feature extraction on the dataset is done using DAISY. Three different architecture are considered.

and the image discription on the above architecture as shown below.

First, let's test the how will the full PCA, randomize PCA and daisy does with a KNeighborsClassifier. Two hyperparameters of the KNeighborsClassifier, n_neighnors(Number of neighbors to use ) and p(Power parameter for the Minkowski metric) is varied to find the most accuracy combination of hyperparameters. Number of neighbors is varied from 1 to 9 and p from 1,2 and 4.

We can see that full PCA and randomized PCA are much more accurate than Daisy mode. Furthermore, with Number of neighbors as 5 and p as 1, gives the most accurate results. We can see full PCA and randomized PCA are performing is same rate but full PCA is slightly better.

Following is the confusion matrix for the best performing model (full PCA with n_neighbors=5 and p=1). Note: the model accuracy flucates a bit.

In the following code, different daisy architecture are tested.

We can see if the rings are seperated and sparse the Daisy seem to perform better than have a lot of rings.

Now, matching the total Daisy vectors which gives the following results. It is selecting images which do look similar by eye as seen below.

And the following image matching comparision between randomized PCA vs daisy

Following is the result of keypoint matching. The key of the dict. is the index in total dataset,X, the first item in the value of the dict is the index of image in X[y==0] or X[y==1], i.e in normal only or pneumonia only dataset and the second value is the percentage of matching keypoints.

For the normal dataset, the two normal images were selecting the same images, after removing the same event bug,we have the following result. Now let's show the respective images. The image matching is done seperately in two classes. So for one image there will be two matches one for normal and another for pneumonia class.

For the normal images, the color in these images are defult color, they are still gray scale images.

We can see the third row is showing 100% match in dasiy feature keypoint, and the images second image in the third row looks to be same picture just turned to one side compared to the first image. The last row does not looks like it matches any part of the first image.

Now for pneumonia images.

For matching the pneumonia images, we see bottom three are perfect match, the second image by eye does look like some transformation to the first images. This is doing better than the normal images.

Due to the time constrain and limited computational capacity, running over all the images is not feasible at the moment.

Here, I just wanted to use sobel filter in one of the images. Third image is with inverted color to second image.

Computational Limitation here, applying the gabor filer took 560 mins! The notebook restarted after running the following code so I am unable to do further testing on the gabor features mainly due to time constrants.

Conclusion

The Gabor Feature extractor is underperforming to other Feature extractor we have looked at, if we compare n_neighbors=1 and p=2. for example for the same parameters PCA's are in range of 92% accurate and Daisy's at 86% accuracy. In conclusion, PCA seems to be performing best as a feature extractor, however only few daisy architecture were test. For Daisy, having sparse initial rings (red dots) with wider radius seems to work best, further test can be conducted with fewer and further rings.

Citation


  1. Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C., Liang, H., Baxter, S. L., . . . Zhang, K. (2018, February 22). Identifying medical diagnoses and treatable diseases by image-based Deep Learning. Cell, 172(5). doi:10.1016/j.cell.2018.02.010
  2. https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
  3. https://github.com/eclarson/MachineLearningNotebooks/blob/master/04.%20Dimension%20Reduction%20and%20Images.ipynb